Versioning Example (Part 1/3)

In this example, we'll train an NLP model for sentiment analysis of tweets using spaCy.

Through this series, we'll take advantage of ModelDB's versioning system to keep track of changes.

This workflow requires verta>=0.14.1 and spaCy>=2.0.0.


Setup

Download a spaCy model to train.


In [1]:
!python -m spacy download en_core_web_sm


Requirement already satisfied: en_core_web_sm==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (2.2.5)
Requirement already satisfied: spacy>=2.2.2 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from en_core_web_sm==2.2.5) (2.2.4)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.2)
Requirement already satisfied: numpy>=1.15.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.18.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.23.0)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.1.3)
Requirement already satisfied: thinc==7.4.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (7.4.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.0.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.4.1)
Requirement already satisfied: setuptools in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (41.2.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.6.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (4.43.0)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: certifi>=2017.4.17 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2019.11.28)
Requirement already satisfied: idna<3,>=2.5 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2.9)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.25.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.4)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.5.0)
Requirement already satisfied: zipp>=0.5 in /Users/miliu/Documents/modeldb/client/workflows/venv-flow/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.1.0)
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')

Import libraries we'll need.


In [2]:
from __future__ import unicode_literals, print_function

import boto3
import json
import numpy as np
import pandas as pd
import spacy

Bring in Verta's ModelDB client to organize our work, and log and version metadata.


In [3]:
from verta import Client

client = Client('https://app.verta.ai')
proj = client.set_project('Tweet Classification')
expt = client.set_experiment('SpaCy')


set email from environment
set developer key from environment
connection successfully established
set existing Project: Tweet Classification from personal workspace
set existing Experiment: SpaCy

Prepare Data

Download a dataset of English tweets from S3 for us to train with.


In [4]:
S3_BUCKET = "verta-starter"
S3_KEY = "english-tweets.csv"
FILENAME = S3_KEY

boto3.client('s3').download_file(S3_BUCKET, S3_KEY, FILENAME)

Then we'll load and clean the data.


In [5]:
import utils

data = pd.read_csv(FILENAME).sample(frac=1).reset_index(drop=True)
utils.clean_data(data)

data.head()


Out[5]:
text sentiment
0 the price of fame 0
1 My company (toggle, who powers anderra) has ju... 1
2 the littlest prince- im rereading it, its been... 1
3 that hit makes me sad poor welker. hes a toug... 0
4 byebye everyone. 1

Capture and Version Model Ingredients

We'll first capture metadata about our code, configuration, dataset, and environment using utilities from the verta library.


In [6]:
from verta.code import Notebook
from verta.configuration import Hyperparameters
from verta.dataset import S3
from verta.environment import Python

code_ver = Notebook()  # Notebook & git environment
config_ver = Hyperparameters({'n_iter': 20})
dataset_ver = S3("s3://{}/{}".format(S3_BUCKET, S3_KEY))
env_ver = Python()  # pip environment and Python version


Then, to log them, we'll use a ModelDB repository to prepare a commit.


In [7]:
repo = client.set_repository('Tweet Classification')
commit = repo.get_commit(branch='master')


set existing Repository: Tweet Classification from personal workspace

Now we'll add these versioned components to the commit and save it to ModelDB.


In [8]:
commit.update("notebooks/tweet-analysis", code_ver)
commit.update("config/hyperparams", config_ver)
commit.update("data/tweets", dataset_ver)
commit.update("env/python", env_ver)

commit.save("Initial model")

commit


Out[8]:
(Branch: master)
Commit e9f25d8206115119d202c62f540a60e6d988615e6c96e9c0701b67b8b5c2c9f9 containing:
config/hyperparams (Hyperparameters)
data/tweets (S3)
env/python (Python)
notebooks/tweet-analysis (Notebook)

Train and Log Model

We'll use the pre-trained spaCy model we downloaded earlier...


In [9]:
nlp = spacy.load('en_core_web_sm')

...and fine-tune it with our dataset.


In [10]:
import training

training.train(nlp, data, n_iter=20)


Using 16000 examples (12800 training, 3200 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  
16.369	0.729	0.707	0.717
0.360	0.744	0.729	0.736
0.105	0.746	0.734	0.740
0.089	0.751	0.739	0.745
0.076	0.759	0.734	0.746
0.066	0.751	0.730	0.740
0.057	0.747	0.733	0.740
0.046	0.742	0.721	0.731
0.042	0.744	0.722	0.733
0.035	0.741	0.719	0.730
0.031	0.742	0.709	0.725
0.027	0.737	0.715	0.726
0.023	0.733	0.712	0.723
0.022	0.735	0.721	0.728
0.021	0.737	0.712	0.725
0.019	0.742	0.712	0.726
0.016	0.747	0.722	0.734
0.018	0.745	0.720	0.732
0.015	0.744	0.719	0.731
0.014	0.739	0.719	0.729

Now that our model is good to go, we'll log it to ModelDB so our progress is never lost.

Using Verta's ModelDB Client, we'll create an Experiment Run to encapsulate our work, and log our model as an artifact.


In [11]:
run = client.set_experiment_run()

run.log_model(nlp)


created new ExperimentRun: Run 421821584660637939761
upload complete (custom_modules.zip)
upload complete (model.pkl)
upload complete (model_api.json)

And finally, we'll link the commit we created earlier to the Experiment Run to complete our logged model version.


In [12]:
run.log_commit(
    commit,
    {
        'notebook': "notebooks/tweet-analysis",
        'hyperparameters': "config/hyperparams",
        'training_data': "data/tweets",
        'python_env': "env/python",
    },
)

Now we've consolidated all the information we would need to reproduce this model later, or revisit the work we've done!

Proceed to the second notebook to see how problematic commits can be reverted.